Motivation

Wildfire hazards are considered as national threats, which can devour everything in its path, trees, homes and even lives, and spread for miles within a few minutes. The disastrous aftermaths of the deadly wildfires, including their long-lasting threats to human health and the ecosystem, are very worrying and thus require continual attention from the public. According to the U.S. Fire Service, more than 700 wildfires occur every year, burning down approximately 7 million acres of land and destroying more than 26,000 structures. The U.S. government spends over 5 billion dollars each year to fight these uncontrollable monster fires, yet the correct prediction and management of wildfires still remain elusive.

Wildfires in the U.S.

Here are some facts about wildfires in the U.S.

  • Cause
    According to the U.S. Department of Interior, about 90 percent of wildfires in the United States are started by human, resulting from unattended campfires, debris burnings, downed electrical equipments and power lines, negligently discarded cigarettes and intentional acts of arson, etc. Verisk’s 2017 Wildfire Risk Analysis shows that at present, 4.5 million U.S. homes are identified as having high or extreme high risk of wildfire, where more than 2 million homes are in California. The nature (lightning or lava) and climate change, are found to be responsible for the remaining 10 percent.

  • Loss
    In October 2019, significant fires broke out in California and led to the evacuation of over 200,000 people, which were declared as a state of emergency. The Kincade Fire in Sonoma County burned over 76,000 acres. The Getty Fire in Los Angeles caused over 7000 residences to be placed in a mandatory evacuation zone.

    In 2018, there were 58,083 wildfires where 8.8 million acres were burned.

    In 2017, there were 71,499 wildfires where over 10 million acres were burned.

  • Policy
    The governmental strategy on wildfires has undergone a major shift from the primary focus on suppression of wildfires to multiple, comprehensive measures with sustainable potential. Current policies aim to help manage wildfire risk by reduction of hazardous fuels, restoration of the ecosystem and assistance from the communities, etc. For instance, the federal government have formulated and implemented policies on protecting funds dedicated to forest management and resotration, as well as expediting small-scale forest management research projects.

Goals of our Project

Considering these massive, devastating wildfires and their catastropic effects on human communities and the environment, we decide to investigate the following research questions to discover if there is any clear trend and to learn how to better prevent and control wildfires. We also build a linear regression model to assess the relationship between fire duration and some potential risk factors. In this project, we are going to examine:

  • The frequency of wildfires across states over time from 2005 to 2015 and the top ten states with the largest number of wildfires in the ten-year period

  • The geographical distribution of wildfires across states with the number of wildfires per squre mile

  • The most common causes of wildfires

  • The relationship between duration and fire size

  • The association among a number of factors and the duration of wildfires in Riverside, California and Dallas, Texas

Initial Questions

At first, we were interested in exploring the correlation between wildfire’s occurence/ severity and a number of factors, such as location, time, weather and so on. Particularly, we wanted to find out if there is a predictive model that can be used to fit our data. However, because the information of our dataset is quite limited (with many missing values and some wrong values), we failed to provide such a model with high accuracy and precision regarding the wildfire’s occurence/severity at a national level. As a result, we ended up with two counties - Riverside, California and Dallas, Texas, which had the highest number of the most destructive wildfires in year 2015, and attempted to build a regression model to examine the association between the duration of wildfires along with other factors like weather, size and cause. Despite the fact that we narrowed down the scope of our research in terms of predicting wildfires, we have finished all the exploratory data analysis as we proposed at the beginning of this project.

Data

Data resources

  • state and county datasets in package ggplot2

    The state and county dataset are used to plot the map in exploratory analysis and shiny apps.

  • state.x77 dataset

    We use Area variable to calculate the number of wildfires per square mile

  • NOAA Weather Data

    We use this data to find the association among duration, number of fires and tempreture in riverside, California.

  • 1.88 Million US Wildfire. 24 years of geo-referenced wildfire records

    This data publication contains a spatial database of wildfires that occurred in the United States from 1992 to 2015. This dataset includes 1.88 million geo-referenced wildfire records, representing a total of 140 million acres burned during the 24-year period. We mainly focus on the time period from 2005 to 2015 and the following core data elements: discovery and control date, final fire size, causes of wildfires and a point location.

    The variables we used for analysis are:

    • duration = Burning time in hours calculated by discovery_date, discovery_time, cont_data and cont_time.

    • FIRE_SIZE_CLASS = Code for fire size based on the number of acres within the final fire perimeter expenditures (A=greater than 0 but less than or equal to 0.25 acres, B=0.26-9.9 acres, C=10.0-99.9 acres, D=100-299 acres, E=300 to 999 acres, F=1000 to 4999 acres, and G=5000+ acres).

    • STAT_CAUSE_DESCR = Description of the cause of the fire.

Read and clean data

  • Read raw data
## read sqlite raw file, 758.9Mb
raw = dbConnect(SQLite(), "./data/FPA_FOD_20170508.sqlite")
fires = tbl(raw, "Fires") %>% collect()
dbDisconnect(raw)

fires = fires %>% 
  janitor::clean_names() %>% 
  select(fire_year, discovery_date, discovery_time, stat_cause_descr,
         cont_date, cont_time, fire_size, fire_size_class,
         latitude, longitude, state, county, fips_code, fips_name)

## constricted the data between 2005 and 2015 for interest

fire_0515 = fires %>% 
  filter(fire_year %in% c(2005:2015))

## save the dataframe into a csv file

write_csv(fire_0515, path = "./data/fire_0515.csv")
  • Clean data
fire = read_csv("./final_report/data/fire_0515.csv")

#select useful columns
tidy_fire = 
  fire %>% 
  separate(cont_time, into = c("cont_hour","cont_min") ,sep = 2) %>% 
  separate(discovery_time, into = c("disc_hour","disc_min") ,sep = 2) %>% 
  mutate(cont_hour = as.numeric(cont_hour),
         cont_min = as.numeric(cont_min),
         disc_hour = as.numeric(disc_hour),
         disc_min = as.numeric(disc_min))

#calculate duration
state.abb = append(state.abb, c("DC", "PR"))
state.name = append(state.name, c("District of Columbia", "Puerto Rico"))

tidy_fire = 
  tidy_fire %>% 
  # change julian days
  mutate(discovery_date = as.Date(discovery_date - 2458014.5, origin = '2017-09-18'),
         cont_date = as.Date(cont_date - 2458014.5, origin = '2017-09-18'),
         duration_day = as.numeric(difftime(cont_date, discovery_date, units = "days"))) %>% 
  mutate(
    duration_hour = cont_hour - disc_hour,
    duration_min = cont_min - disc_min,
    duration = duration_day * 24 + duration_hour + duration_min / 60
  ) %>% 
  select(-duration_day, -duration_hour,-duration_min) %>% 
  mutate(fips_name = tolower(fips_name),
         state = fct_inorder(state),
         fire_size_class = fct_inorder(fire_size_class),
         region = state.name[match(state, state.abb)],
         stat_cause_descr = as.factor(stat_cause_descr),
         srat_cause_descr = relevel(stat_cause_descr,ref = "Missing/Undefined"))

Exploratory Analysis

First, we would like to see whether there is any noticeable trends in time or noticeable difference between different states. Also, we would like to more which is the most common cause during 10 years. What’s more, we would like to explore whether the fire size and the fire burning duration had some relationship.

Causes of Wildfires

  • Wordclouding of causes

The figure below is the wordclouding of the top cause during 10 years. We could see that Debris Burning is the top cause during 10 years. So we would like to suggest that people might need to pay more attention on the debris burning which often caused serious wildfires.

fire = 
  fire %>% 
  group_by(stat_cause_descr) %>% 
  summarise(n_cause = n())
set.seed(555)

wordcloud(words = fire$stat_cause_descr, freq = fire$n_cause, scale = c(3, .8), min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

title( "Fig 6: Wordclouding of the top cause during 10 years")

  • Cause ranking

The figure below shows that ranking of cause. We counted how many cases due to specific cause and ranked them. It is clear that Debris Burning is the most common cause.

rank_cause = tidy_fire %>% 
  group_by(stat_cause_descr) %>% 
  summarize(count = n()) %>% 
  mutate(stat_cause_descr =fct_reorder(stat_cause_descr, count)) %>% 
  mutate(cause = stat_cause_descr) %>% 
  select(-stat_cause_descr) %>% 
  ggplot(aes(x = cause, y = count)) +
  geom_bar(stat = "identity", aes(fill = cause), alpha=.6, width=.4) +
  coord_flip() +
  labs(x = "", y = "Number of Fires", title = "Fig 7: Wildfire Counts in the U.S. by Causes from 2005 to 2015") +
  viridis::scale_color_viridis() + theme_bw() + theme(legend.position = "none")

ggplotly(rank_cause)
<<<<<<< HEAD
======= <<<<<<< HEAD
======= <<<<<<< HEAD
======= <<<<<<< HEAD
=======
>>>>>>> 84eead5743feb2ff91cf5636f331b10e3209fc01 >>>>>>> 9298745d03379e7bf8ee4f2a16779107490a263c >>>>>>> 8f578fd6941c43a0772a5372aaea9f246d19aaa5 >>>>>>> eff09364309785b1aeee50d6afc3ac96ee817c44

Association between Duration and Fire Size

The figure below shows that the distribution of duration(less than 48 hours) for different fire size class. We strict the duration to see the trend clear, We could see that the curve always concentrated in duration = 0. This is because information bias. There were lots of data whose duration is equal to zero and we had no idea about what the fact was.

Regardless the duration = 0, since the fire class A is the smallest size, we could see that it had short duration compared to the higher fire class like F or G. So we assumed that there was some relationship between fire burning duration and the fire size. So we did some model about it in our analysis.

size_duration = 
  tidy_fire %>% 
  mutate(fire_size_class = fct_relevel(fire_size_class, c("A", "B", "C", "D", "E", "F", "G"))) %>% 
  drop_na(duration) %>% 
  filter(duration < 48) %>%
  filter(duration != 0) %>%  
  ggplot(aes(x = duration, fill = fire_size_class)) +
  geom_density(alpha = 0.4) +
  labs(x = "Duration (hours)", fill = "Fire Size Class",
       title = "Fig 8: Distribution of duration for different fire size class") +
  theme_bw() 

ggplotly(size_duration)
<<<<<<< HEAD
======= <<<<<<< HEAD
======= <<<<<<< HEAD
======= <<<<<<< HEAD
=======
>>>>>>> 84eead5743feb2ff91cf5636f331b10e3209fc01 >>>>>>> 9298745d03379e7bf8ee4f2a16779107490a263c >>>>>>> 8f578fd6941c43a0772a5372aaea9f246d19aaa5 >>>>>>> eff09364309785b1aeee50d6afc3ac96ee817c44

Statistical Analysis

<<<<<<< HEAD

Discussion

=======

Results and Discussion

>>>>>>> eff09364309785b1aeee50d6afc3ac96ee817c44

In general, the number of wildfires in the U.S. from 2005 to 2015 is decreasing. Wildfires are more likely to happen from February to August. Texas had the highest number of wildfires during 2005 - 2015, while surprisingly, New York had the highest number of wildfires per square mile. The latter one should be more interesting because they will allow us to more directly compare wildfire across states. Plus, we find the most common cause to be debris burning. For wildfires with the highest burning sizes, they also have the longest duration.

<<<<<<< HEAD We recognize that wildfires have various causes and the severity (size, duration, etc.) of wildfires differ by a number of factors; thus, the final result that we present here may not be as accurate and comprehensive as we hoped it to be. Furthermore, as climate change continues to intensify wildfires, it is everyone’s responsibility to understand and learn how to prevent wildfires from happening and protect nature and ourselves. In light of the above, we shall continue our quest to correctly and timely predict future wildfires, using extensive datasets with advanced predictive modelling techniques. ======= We recognize that wildfires have various causes and the severity (size, duration, etc.) of wildfires differ by a number of factors; thus, the final result that we present here may not be as accurate and comprehensive as we hoped it to be. Furthermore, as climate change continues to intensify wildfires, it is everyone’s responsibility to understand and learn how to prevent wildfires from happening and protect nature and ourselves. Furthermore, we shall continue our quest to correctly and timely predict future wildfires, using extensive dataset with advanced predictive modelling techniques. >>>>>>> 84eead5743feb2ff91cf5636f331b10e3209fc01